Part1
This exercise concerns the clinical descriptions of tumours from The Cancer Genome Archive. It was previously downloaded from GEO and has undergone some minor alterations. See the script process_tcga_clinical.R.
The data are provided as the file tcga_clinical.tsv in the raw_data directory of the r_crash_course.zip file
Exercise: What function from readr would you use to read the file tcga_clinical.tsv into R? Read the file in. What are the number of rows and columns?
Since the input file is a “tab-separated” file we need to use read_tsv. Before we can use the function we need to load the readr library.
### Remember that every time we want to use a function from a particular package, that package needs to be loaded from our library
library(readr)
data <- read_tsv("raw_data/tcga_clinical.tsv")
One or more parsing issues, see `problems()` for detailsRows: 7706 Columns: 420
-- Column specification -----------------------------------------------------
Delimiter: "\t"
chr (395): bcr_patient_barcode, bcr_patient_uuid, form_completion_date, p...
dbl (23): initial_pathologic_dx_year, age_at_diagnosis, percent_blasts_p...
lgl (2): sarcomatoid_features, sarcomatoid_percent_of_tumor
i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.
data
You should find that the data frame contains a great deal of columns; far too many to be useful. We would like to keep the columns containing the age of the patient, and the tumour stage in our analysis. Rather than opening-up the file, or Viewing the file in RStudio, we can use a couple of helper functions to identify the relevant column names.
Exercise: Use the select function in conjunction with contains and starts_with to identify columns that have Age or Stage information their name. The code should look like the following (you will need to fill-in the dots).
The functions contains and starts_with perform similar operations when used to select columns from a data frame. To use either, and to use the select function, we first have to load the dplyr library. Using the contains function will identify all columns that have a particular text pattern somewhere in their name. If we wanted all the columns with “age” in the name, the following wouldn’t be a good choice as it would also identify columns with “stage” in.
library(dplyr)
select(data, contains("age"))
But if we wanted all the columns regarding “stage”, contains would be a good choice
select(data, contains("stage"))
Since the age-related columns start with “age” we can use the starts_with function instead.
select(data, starts_with("age"))
select(data, contains("age"),
-contains("stage"),
-contains("agent"),
-contains("heritage"),
-contains("percentage"))
Exercise: Use the select function to create a new data frame that contains the following columns. These are not the actual columns names - Tumour site - Race - Gender - Age at diagnosis - Dead / Alive Status You can add extra columns if you wish
See below for example output
In this case we can list the columns that we want to select. The code below makes use of the pipe %>% which allows the output from one line of code to be used as an input in the next line.
data <- read_tsv("raw_data/tcga_clinical.tsv") %>%
select(tumor_tissue_site,
race,
gender,
age_at_initial_pathologic_diagnosis,
vital_status)
One or more parsing issues, see `problems()` for details
data
Exercise: Use the dplyr function called count to tabulate how which sites are included in the data. Re-arrange the output from count using arrange to determine the most common type of cancer in the dataset.
See below for example output
The count function takes a data frame as it’s first argument, followed by the name of the column that we want to obtain counts for.
The default for count is to report the results in ascending order.
count(data, tumor_tissue_site)
For our particular use-case we want the tissue typesin descending order. Fortunately, there is a convenient function in dplyr that will this. We can also take advantage of the pipe operation to chain the steps together.
count(data, tumor_tissue_site) %>%
arrange(desc(n))
This reveals a problem; the most common entry in the column is NA; a special value in R representing a lack of data. Missing data, and the various ways used to represent it, is sometimes an issue in data analysis. We can also see that the term [Not Available] is sometimes used.
Exercise: Not all samples have an entry for tumour type. Use the filter function to create a table with valid entries for tumor_tissue_site. Create a barplot (geom_bar) or column plot (geom_col) to show display the number of occurences of each tumour type
HINT: An easy way to make the labels on the x-axis more legible is to use the coord_flip function
ggplot(data, aes(x=...)) + geom_bar() + coord_flip()
We can use the filter function to exclude the [Not Available] entries from our data. The == sign can be used to identify entries that are equal to [Not Available], whereas != can be used to identify entries that are not equal to [Not Available].
filter(data, tumor_tissue_site == "[Not Available]")
filter(data, tumor_tissue_site != "[Not Available]")
Removing the NA entries is a bit more tricky, as neither of the lines of code below work as you might expect.
filter(data, tumor_tissue_site == NA)
filter(data, tumor_tissue_site == "NA")
We need to use a special function is.na to identify NA entries, and as above !is.na will identify rows that do not contain an NA
filter(data, !is.na(tumor_tissue_site))
Putting everything together, we get the following R code. We use the geom_col function which requires both the x and y aesthetics.
filter(data,!is.na(tumor_tissue_site)) %>%
filter(tumor_tissue_site != "[Not Available]") %>%
count(tumor_tissue_site) %>%
arrange(desc(n)) %>%
ggplot(aes(x = tumor_tissue_site,y=n)) + geom_col() + coord_flip()

Strictly speaking, we don’t actually need the counting step as these counts required for the plot will be generated automatically if we use `geom_bar’ instead.
filter(data,!is.na(tumor_tissue_site)) %>%
filter(tumor_tissue_site != "[Not Available]") %>%
ggplot(aes(x = tumor_tissue_site)) + geom_bar() + coord_flip()

You might be wondering why the rows are ordering alphabetically rather than in count order. The default ordering for a factor in R is alphabetical.
The forcats package (which of part of tidyverse) can be used if we want a different order to be shown.
Rather than the x axis being mapped to tumor_tissue_site, it can be maped to a re-ordered version of the factor.
...aes(x=fct_reorder(tumor_tissue_site,n))...
library(forcats)
filter(data,!is.na(tumor_tissue_site)) %>%
filter(tumor_tissue_site != "[Not Available]") %>%
count(tumor_tissue_site) %>%
arrange(desc(n)) %>%
ggplot(aes(x = fct_reorder(tumor_tissue_site,n),y=n)) + geom_col() + coord_flip()

---
title: "R crash course exercise"
output: 
  html_notebook: 
    css: stylesheets/styles.css
---

# Part1

This exercise concerns the clinical descriptions of tumours from The Cancer Genome Archive. It was previously downloaded from [GEO](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE62944) and has undergone some minor alterations. See the script [process_tcga_clinical.R](/process_tcga_clinical.R).

The data are provided as the file `tcga_clinical.tsv` in the `raw_data` directory of the `r_crash_course.zip` file

<div class="exercise">
**Exercise**: What function from `readr` would you use to read the file `tcga_clinical.tsv` into R? Read the file in. What are the number of rows and columns?

</div>

Since the input file is a "tab-separated" file we need to use `read_tsv`. Before we can use the function we need to load the `readr` library.

```{r}
### Remember that every time we want to use a function from a particular package, that package needs to be loaded from our library

library(readr)
data <- read_tsv("raw_data/tcga_clinical.tsv")
data
```


You should find that the data frame contains a great deal of columns; far too many to be useful. We would like to keep the columns containing the age of the patient, and the tumour stage in our analysis. Rather than opening-up the file, or `View`ing the file in RStudio, we can use a couple of helper functions to identify the relevant column names.

<div class="exercise">
**Exercise**: Use the `select` function in conjunction with `contains` and `starts_with` to identify columns that have Age or Stage information their name. The code should look like the following (you will need to fill-in the dots).

</div>

The functions `contains` and `starts_with` perform similar operations when used to select columns from a data frame. To use either, and to use the `select` function, we first have to load the `dplyr` library. Using the `contains` function will identify all columns that have a particular text pattern somewhere in their name. If we wanted all the columns with "age" in the name, the following wouldn't be a good choice as it would also identify columns with "stage" in.

```{r}
library(dplyr)
select(data, contains("age"))
```
But if we wanted all the columns regarding "stage", `contains` would be a good choice

```{r}
select(data, contains("stage"))
```
Since the age-related columns start with "age" we can use the `starts_with` function instead.

```{r}
select(data, starts_with("age"))
```
```{r}
select(data, contains("age"), 
       -contains("stage"), 
       -contains("agent"),
       -contains("heritage"),
       -contains("percentage"))
```


<div class="exercise">
**Exercise:** Use the `select` function to create a new data frame that contains the following columns. **These are not the actual columns names**
  - Tumour site
  - Race
  - Gender
  - Age at diagnosis
  - Dead / Alive Status
You can add extra columns if you wish

**See below for example output**
</div>

In this case we can list the columns that we want to select. The code below makes use of the pipe ` %>% ` which allows the output from one line of code to be used as an input in the next line.

```{r message=FALSE}
data <- read_tsv("raw_data/tcga_clinical.tsv") %>% 
      select(tumor_tissue_site,
                race,
                gender,
                age_at_initial_pathologic_diagnosis,
                vital_status)
data
```



<div class="exercise">
**Exercise:** Use the `dplyr` function called `count` to tabulate how which sites are included in the data. Re-arrange the output from `count` using `arrange` to determine the most common type of cancer in the dataset.

**See below for example output**
</div>

The `count` function takes a data frame as it's first argument, followed by the name of the column that we want to obtain counts for.

The default for `count` is to report the results in *ascending* order.

```{r}
count(data, tumor_tissue_site)
```
For our particular use-case we want the tissue typesin *descending* order. Fortunately, there is a convenient function in `dplyr` that will this. We can also take advantage of the pipe operation to chain the steps together.

```{r }
count(data, tumor_tissue_site) %>% 
  arrange(desc(n))
```
This reveals a problem; the most common entry in the column is `NA`; a special value in R representing a lack of data. Missing data, and the various ways used to represent it, is sometimes an issue in data analysis. We can also see that the term `[Not Available]` is sometimes used.

<div class="exercise">
**Exercise**: Not all samples have an entry for tumour type. Use the `filter` function to create a table with valid entries for `tumor_tissue_site`. Create a barplot (`geom_bar`) or column plot (`geom_col`) to show display the number of occurences of each tumour type

HINT: An easy way to make the labels on the x-axis more legible is to use the `coord_flip` function

```{r eval=FALSE}
ggplot(data, aes(x=...)) + geom_bar() + coord_flip()
```

</div>

We can use the `filter` function to exclude the `[Not Available]` entries from our data. The `==` sign can be used to identify entries that *are* equal to `[Not Available]`, whereas `!=` can be used to identify entries that are *not* equal to `[Not Available]`.

```{r}
filter(data, tumor_tissue_site == "[Not Available]")
```
```{r}
filter(data, tumor_tissue_site != "[Not Available]")
```

Removing the `NA` entries is a bit more tricky, as neither of the lines of code below work as you might expect.

```{r}
filter(data, tumor_tissue_site == NA)
filter(data, tumor_tissue_site == "NA")
```

We need to use a special function `is.na` to identify `NA` entries, and as above `!is.na` will identify rows that do not contain an `NA`

```{r}
filter(data, !is.na(tumor_tissue_site))
```
Putting everything together, we get the following R code. We use the `geom_col` function which requires both the `x` and `y` aesthetics.

```{r}
  filter(data,!is.na(tumor_tissue_site)) %>% 
  filter(tumor_tissue_site != "[Not Available]") %>% 
  count(tumor_tissue_site) %>% 
  arrange(desc(n)) %>% 
  ggplot(aes(x = tumor_tissue_site,y=n)) + geom_col() + coord_flip()
```

Strictly speaking, we don't actually need the `counting` step as these counts required for the plot will be generated automatically if we use `geom_bar' instead.

```{r}
  filter(data,!is.na(tumor_tissue_site)) %>% 
  filter(tumor_tissue_site != "[Not Available]") %>% 
  ggplot(aes(x = tumor_tissue_site)) + geom_bar() + coord_flip()
```

You might be wondering why the rows are ordering alphabetically rather than in count order. The default ordering for a *factor* in R is alphabetical. 

The `forcats` package (which of part of `tidyverse`) can be used if we want a different order to be shown.

- [forcats package description](https://forcats.tidyverse.org/)

Rather than the x axis being mapped to `tumor_tissue_site`, it can be maped to a re-ordered version of the factor.

```{r eval=FALSE}
...aes(x=fct_reorder(tumor_tissue_site,n))...
```


```{r}
library(forcats)
  filter(data,!is.na(tumor_tissue_site)) %>% 
  filter(tumor_tissue_site != "[Not Available]") %>% 
  count(tumor_tissue_site) %>% 
  arrange(desc(n)) %>% 
  ggplot(aes(x = fct_reorder(tumor_tissue_site,n),y=n)) + geom_col() + coord_flip()
```

